Phrasal Rank-Encoding: Exploiting Phrase Redundancy and Translational Relations for Phrase Table Compression

نویسنده

  • Marcin Junczys-Dowmunt
چکیده

Wedescribe Phrasal Rank-Encoding (PR-Enc), a novel method for the compression of wordaligned target language data in phrase tables as used in phrase-based SMT. This method reduces the redundancy in phrase tables which is a direct effect of the phrase-based approach. A combination of PR-Enc with Huffman coding allows to reduce the size of an aggressively compressed phrase table by another 39 percent. Using this and other methods for space reduction in a new binary phrase table implementation, a size reduction by an order of magnitude is achieved when comparing to the Moses on-disk phrase table implementation. Concerning decoding speed, all variants of the new phrase table are faster than the Moses binary phrase table implementation while the PR-Enc encoded variant outperforms all other methods.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Phrase Table without Phrases: Rank Encoding for Better Phrase Table Compression

This paper describes the first steps towards a minimum-size phrase table implementation to be used for phrase-based statistical machine translation. The focus lies on the size reduction of target language data in a phrase table. Rank Encoding (REnc), a novel method for the compression of word-aligned target language in phrase tables is presented. Combined with Huffman coding a relative size red...

متن کامل

Augmenting Phrase Table by Employing Lexicons for Pivot-based SMT

Pivot language is employed as a way to solve the data sparseness problem in machine translation, especially when the data for a particular language pair does not exist. The combination of source-to-pivot and pivot-to-target translation models can induce a new translation model through the pivot language. However, the errors in two models may compound as noise, and still, the combined model may ...

متن کامل

Hallucinating Phrase Translations for Low Resource MT

We demonstrate that “hallucinating” phrasal translations can significantly improve the quality of machine translation in low resource conditions. Our hallucinated phrase tables consist of entries composed from multiple unigram translations drawn from the baseline phrase table and from translations that are induced from monolingual corpora. The hallucinated phrase table is very noisy. Its transl...

متن کامل

Mandarin Chinese Prosodic Phrase Grouping and Modeling—Method and Implications

One major feature of the prosody of Mandarin Chinese speech flow is prosodic phrase grouping [1, 2, and 3]. Phrasal and sentential intonations are governed by a prosody framework that structurally group phrases into a larger/longer and identifiable unit. An overall prosody pattern of such phrase grouping with prosodic specifications is superimposed on phrase group. In turn, individual phrasal i...

متن کامل

Phrasal complexity and the occurrence of filled pauses in presentation speeches in Japanese

Filled pauses are ubiquitous in everyday speech. I investigated whether linguistic complexity of upcoming phrases affects filler rate at phrase boundaries in presentation speeches in Japanese. Filler rate at phrase boundaries increased monotonically with complexity of the following phrases. However, when the following phrase was composed of more than 11 Bunsetsu-phrases, the filler rate did not...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Prague Bull. Math. Linguistics

دوره 98  شماره 

صفحات  -

تاریخ انتشار 2012